35 research outputs found

    Differentiable molecular simulation can learn all the parameters in a coarse-grained force field for proteins

    Get PDF
    Finding optimal parameters for force fields used in molecular simulation is a challenging and time-consuming task, partly due to the difficulty of tuning multiple parameters at once. Automatic differentiation presents a general solution: run a simulation, obtain gradients of a loss function with respect to all the parameters, and use these to improve the force field. This approach takes advantage of the deep learning revolution whilst retaining the interpretability and efficiency of existing force fields. We demonstrate that this is possible by parameterising a simple coarse-grained force field for proteins, based on training simulations of up to 2,000 steps learning to keep the native structure stable. The learned potential matches chemical knowledge and PDB data, can fold and reproduce the dynamics of small proteins, and shows ability in protein design and model scoring applications. Problems in applying differentiable molecular simulation to all-atom models of proteins are discussed along with possible solutions and the variety of available loss functions. The learned potential, simulation scripts and training code are made available at https://github.com/psipred/cgdms

    Recent Developments in Deep Learning Applied to Protein Structure Prediction

    Get PDF
    Although many structural bioinformatics tools have been using neural network models for a long time, deep neural network (DNN) models have attracted considerable interest in recent years. Methods employing DNNs have had a significant impact in recent CASP experiments, notably in CASP12 and especially CASP13. In this article, we offer a brief introduction to some of the key principles and properties of DNN models and discuss why they are naturally suited to certain problems in structural bioinformatics. We also briefly discuss methodological improvements that have enabled these successes. Using the contact prediction task as an example, we also speculate why DNN models are able to produce reasonably accurate predictions even in the absence of many homologues for a given target sequence, a result which can at first glance appear surprising given the lack of input information. We end on some thoughts about how and why these types of models can be so effective, as well as a discussion on potential pitfalls. This article is protected by copyright. All rights reserved

    Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins

    Get PDF
    Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper

    A guide to machine learning for biologists

    Get PDF
    The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed

    AlloPred: prediction of allosteric pockets on proteins using normal mode perturbation analysis

    Get PDF
    BACKGROUND: Despite being hugely important in biological processes, allostery is poorly understood and no universal mechanism has been discovered. Allosteric drugs are a largely unexplored prospect with many potential advantages over orthosteric drugs. Computational methods to predict allosteric sites on proteins are needed to aid the discovery of allosteric drugs, as well as to advance our fundamental understanding of allostery. RESULTS: AlloPred, a novel method to predict allosteric pockets on proteins, was developed. AlloPred uses perturbation of normal modes alongside pocket descriptors in a machine learning approach that ranks the pockets on a protein. AlloPred ranked an allosteric pocket top for 23 out of 40 known allosteric proteins, showing comparable and complementary performance to two existing methods. In 28 of 40 cases an allosteric pocket was ranked first or second. The AlloPred web server, freely available at http://www.sbg.bio.ic.ac.uk/allopred/home, allows visualisation and analysis of predictions. The source code and dataset information are also available from this site. CONCLUSIONS: Perturbation of normal modes can enhance our ability to predict allosteric sites on proteins. Computational methods such as AlloPred assist drug discovery efforts by suggesting sites on proteins for further experimental study

    Antiinflammatory Therapy with Canakinumab for Atherosclerotic Disease

    Get PDF
    Background: Experimental and clinical data suggest that reducing inflammation without affecting lipid levels may reduce the risk of cardiovascular disease. Yet, the inflammatory hypothesis of atherothrombosis has remained unproved. Methods: We conducted a randomized, double-blind trial of canakinumab, a therapeutic monoclonal antibody targeting interleukin-1β, involving 10,061 patients with previous myocardial infarction and a high-sensitivity C-reactive protein level of 2 mg or more per liter. The trial compared three doses of canakinumab (50 mg, 150 mg, and 300 mg, administered subcutaneously every 3 months) with placebo. The primary efficacy end point was nonfatal myocardial infarction, nonfatal stroke, or cardiovascular death. RESULTS: At 48 months, the median reduction from baseline in the high-sensitivity C-reactive protein level was 26 percentage points greater in the group that received the 50-mg dose of canakinumab, 37 percentage points greater in the 150-mg group, and 41 percentage points greater in the 300-mg group than in the placebo group. Canakinumab did not reduce lipid levels from baseline. At a median follow-up of 3.7 years, the incidence rate for the primary end point was 4.50 events per 100 person-years in the placebo group, 4.11 events per 100 person-years in the 50-mg group, 3.86 events per 100 person-years in the 150-mg group, and 3.90 events per 100 person-years in the 300-mg group. The hazard ratios as compared with placebo were as follows: in the 50-mg group, 0.93 (95% confidence interval [CI], 0.80 to 1.07; P = 0.30); in the 150-mg group, 0.85 (95% CI, 0.74 to 0.98; P = 0.021); and in the 300-mg group, 0.86 (95% CI, 0.75 to 0.99; P = 0.031). The 150-mg dose, but not the other doses, met the prespecified multiplicity-adjusted threshold for statistical significance for the primary end point and the secondary end point that additionally included hospitalization for unstable angina that led to urgent revascularization (hazard ratio vs. placebo, 0.83; 95% CI, 0.73 to 0.95; P = 0.005). Canakinumab was associated with a higher incidence of fatal infection than was placebo. There was no significant difference in all-cause mortality (hazard ratio for all canakinumab doses vs. placebo, 0.94; 95% CI, 0.83 to 1.06; P = 0.31). Conclusions: Antiinflammatory therapy targeting the interleukin-1β innate immunity pathway with canakinumab at a dose of 150 mg every 3 months led to a significantly lower rate of recurrent cardiovascular events than placebo, independent of lipid-level lowering. (Funded by Novartis; CANTOS ClinicalTrials.gov number, NCT01327846.

    The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data

    Get PDF
    This paper documents the seventeenth data release (DR17) from the Sloan Digital Sky Surveys; the fifth and final release from the fourth phase (SDSS-IV). DR17 contains the complete release of the Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey, which reached its goal of surveying over 10,000 nearby galaxies. The complete release of the MaNGA Stellar Library accompanies this data, providing observations of almost 30,000 stars through the MaNGA instrument during bright time. DR17 also contains the complete release of the Apache Point Observatory Galactic Evolution Experiment 2 survey that publicly releases infrared spectra of over 650,000 stars. The main sample from the Extended Baryon Oscillation Spectroscopic Survey (eBOSS), as well as the subsurvey Time Domain Spectroscopic Survey data were fully released in DR16. New single-fiber optical spectroscopy released in DR17 is from the SPectroscipic IDentification of ERosita Survey subsurvey and the eBOSS-RM program. Along with the primary data sets, DR17 includes 25 new or updated value-added catalogs. This paper concludes the release of SDSS-IV survey data. SDSS continues into its fifth phase with observations already underway for the Milky Way Mapper, Local Volume Mapper, and Black Hole Mapper surveys

    Predicting protein dynamics and allostery using multi-protein atomic distance constraints

    Get PDF
    The related concepts of protein dynamics, conformational ensembles and allostery are of- ten difficult to study with molecular dynamics (MD) due to the timescales involved. We present ExProSE (Exploration of Protein Structural Ensembles), a distance geometry-based method that generates an ensemble of protein structures from two input structures. ExProSE provides a unified framework for the exploration of protein structure and dynamics in a fast and accessible way. Using a dataset of apo/holo pairs it is shown that existing coarse-grained methods can often not span large conformational changes. For T4-lysozyme ExProSE is able to generate ensembles that are more native-like than tCONCOORD and NMSim, and com- parable to targeted MD. By adding additional constraints representing potential modulators, ExProSE can predict allosteric sites. ExProSE ranks an allosteric pocket first or second for 27 out of 58 allosteric proteins, which is similar and complementary to existing methods. The ExProSE source code is freely-available

    BioStructures.jl: read, write and manipulate macromolecular structures in Julia

    Get PDF
    Summary: Robust, flexible and fast software to read, write and manipulate macromolecular structures is a prerequisite for productively doing structural bioinformatics. We present BioStructures.jl, the first dedicated package in the Julia programming language for dealing with macromolecular structures and the Protein Data Bank. BioStructures.jl builds on the lessons learned with similar packages to provide a large feature set, a flexible object representation and high performance. Availability and implementation: BioStructures.jl is freely available under the MIT license. Source code and documentation are available at https://github.com/BioJulia/BioStructures.jl. BioStructures.jl is compatible with Julia versions 0.6 and later and is system-independent
    corecore